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Abstract. Missing link prediction of networks is of both theoretical interest and practical significance in 
modern science. In this paper, we empirically investigate a simple framework of link prediction on the basis 
of node similarity. We compare nine well-known local similarity measures on six real networks. The results 
indicate that the simplest measure, namely Common Neighbors, has the best overall performance, and 
the Adamic-Adar index performs the second best. A new similarity measure, motivated by the resource 
allocation process taking place on networks, is proposed and shown to have higher prediction accuracy 
than common neighbors. It is found that many links are assigned same scores if only the information of 
the nearest neighbors is used. We therefore design another new measure exploited information of the next 
nearest neighbors, which can remarkably enhance the prediction accuracy. 

PACS. 89.75.-k Complex systems - 05.65.+b Self-organized systems 

1 Introduction made to understand the evolution of networks [HI], the 

relations between topologies and functions [3l|4], and the 

Many social, biological, and information systems can be , , , . , „ , 

network characteristics [5 . Very recently, a iresn ques- 

properly described as networks with nodes representing . . , „i , . , , ,. , . . ,. , 

tion IS raised 6J, that is, how to predict missing links 

individuals or organizations and edges mimicking the in- _ , o i • h i i • i • 

ot networks."^ hor some networks, especially the biologi- 

teractions among them. The study of complex networks , , , , ; ■ • • ; • ; , 

cal networks such as protem-protem interaction networks, 

has attracted increasing attention and become a common , , ,. , , i n i ^ ^ i- r i- i 

metabolic networks and lood webs, the discovery ot links 

focus of many branches of science. Many efforts have been /. . . n i • i i i i 

(i.e., mteractionsj costs much m the laboratory or the 
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field, and thus the current knowledge of those networks ness property of small-world networks to assess confidence 

is substantially incomplete [HIl]. Instead of blindly check- for individual protein- protein interactions. Liben-Nowell 

ing all possible interactions, to predict in advance based and Kleinberg [17j empirically investigated the similarity- 

on the interactions known already and focus on those links based prediction algorithms for large scientific coUabora- 

most likely to exist can sharply reduce the experimental tion networks. Clauset et al. [18j designed a prediction 

costs if the predictions are accurate enough. For some oth- algorithm based on the inherent hierarchical organization 

ers like the friendship networks in web society [^I10|. very of social and biological networks, 
likely but not yet existing links can be suggested to the rel- 

These mentioned works are practically successful in 

evant users as recommendations of promising friendships, 

dealing with specific networks, however, thus far, a com- 

which can help users finding new friends and thus enhance 

prehensive picture about the dependence of algorithmic 

their loyalties to web sites. 

performance on network topology is lacking. The reason 

Majority of the previous works on missing link pre- is twofold: (i) the works from engineering and biological 

diction have used some external information besides the communities have not yet caught up with the current state 

network topology [TT] . Graven et al. [T^] predicted the se- of development in characterizing the topologies of complex 

mantic relationships of the world wide web with the help networks [5] , while (ii) the physics community has not paid 

of web content. Popescul and Ungar [l3j designed a regres- enough attention to the link prediction problem. Accord- 

sion model to predict citations made in scientific literature ingiy, dozens of important issues are still less explored. For 

based on not only the citation graph, but also the author- example, one may concern about how to choose a suitable 

ship, journal information and content. Taskar et al. [14j algorithm given some structural descriptions of a network, 

applied the relational Markov network algorithm to pre- such as small- world phenomenon [19| . degree heterogene- 

dict missing links in a network of web pages and a social ity [20j . mixing pattern [21j . community structure |22j . 

network, in which the well-defined attributes of each node and so on. In the opposite viewpoint, comparison of the 

are exploited. O'Madadhain et al. [15 constructed local performances of some prediction algorithms may reveal 

conditional probability models for link prediction, based some structural information of the networks. It is just 

on both structural features and nodes' attributes. The us- like the community structure has significantly effect on 

age of external information can somewhat enhance the the network synchronizability [23j . while the synchroniz- 

algorithmic accuracy, however, the content and attribute ing process can be properly used to reveal the underlying 

information are generally not available, thus the appli- community structure [H] . In addition, the algorithms only 

cations of above algorithms are strongly limited. Gold- based on local information are generally fast but of lower 

berg and Roth [T^] exploited the neighborhood cohesive- accuracy, while the ones making use of the knowledge of 
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global topology are of higher accuracy yet higher com- 2 Method 
putational complexity |17| . Can we find a good tradeoff 

Considering an undirected simple network G{V, E), where 

that provides high quality predictions while requires light 

V is the set of nodes and E is the set of links. The multiple 

computation? 

links and self-connections are not allowed. For each pair 
of nodes, x,y ^ V, every algorithm referred in this paper 
assigns a score as s^y This score can be viewed as a mea- 
sure of similarity between nodes x and y, hereinafter, we 
In this paper, we empirically investigate a simple frame- do not distinguish similarity and score. All the nonexis- 
work of link prediction on the basis of node similarity. Al- tent links are sorted in decreasing order according to their 
though the framework is simple, it opens a rich space for scores, and the links in the top are most likely to exist, 
exploration since the design of similarity measures is chal- To test the algorithmic accuracy, the observed links, 
lenging and can be related to very complicated physics E, is randomly divided into two parts: the training set, 
dynamics and mathematical theory, such as random walk E'^ , is treated as known information, while the probe set, 
pS] and counting problem of spanning trees [25]. Here E^ , is used for testing and no information in probe set is 
we concentrate on local-information-based similarities. We allowed to be used for prediction. Clearly, E ^ E'^ U E^ 
compare nine well-known local measures on six real net- and E'^ f) E^ ^ 0. In this paper, the training set always 
works, and the results indicate that the simplest measure, contains 90% of hnks, and the remaining 10% of hnks 
namely common neighbors, has the best overaU perfor- constitute the probe set. We use a standard metric, area 
mance, which is in accordance with the empirical results under the receiver operating characteristic (ROC) curve 
reported in Ref. [17] . Motivated by the resource allocation [HI, to quantify the accuracy of prediction algorithms, 
process in transportation networks, we next propose a new In the present case, this metric can be interpreted as the 
similarity measure, which performs obviously better than probability that a randomly chosen missing link (a link 
common neighbors, while requires no more information in E^) is given a higher score than a randomly chosen 
and computational time. Furthermore, it is found that nonexistent link (a link in U \ E, where U denotes the 
many links get same scores under local similarity mea- universal set). In the implementation, among n times of 
sures, just like the degeneracy of energy level. We there- independent comparisons, if there are n' times the missing 
fore design a new measure using the information of the link having higher score and n" times the missing link 
next nearest neighbors, which can break the "degenerate and nonexistent link having the same score, we define the 
states" thus remarkably enhance the algorithmic accuracy, accuracy as: 

_n' + 0.5n" 

Finally, we outline some future interests in this direction. AUG — . (1) 
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If all the scores are generated from an independent and Table 1. The basic topological features of six example net- 
identical distribution, the accuracy should be about 0.5. works. iV and M are the total numbers of nodes and links, 
Therefore, the degree to which the accuracy exceeds 0.5 respectively. A^^ denotes the size of the giant component, for 
indicates how much better the algorithm performs than example, the entry 2375/92 in the first line means that the 
pure chance network has 92 components and the giant component con- 

sists of 2375 nodes, e is the network efficiency [33], defined as 
^ = jv(j^-i) Ylx,yev.xT^y '^^y , where d^y is the shortest distance 
3 Data between x and y, and d^y = +oo ii x and y are in two different 

components. C and r are clustering coefficient [19] and assor- 
In this paper, we consider six representative networks drawn tative coefficient [21], respectively. Nodes with degree 1 are 
from disparate field: (i) PPL— A protein-protein interac- excluded from the calculation of clustering coefficient. H is the 
tion network containing 2617 proteins and 11855 interac- degree heterogeneity, defined as H — -j^y^, where (fc) denotes 
tions [28] ■ Although this network is not well connected (it the average degree, 
contains 92 components), most of nodes belong to the gi- 



ant component, whose size is 2375. (ii) NS. — A network of 


Nets 


TV 


M 


TV, 


e 


C 


r 


H 


coauthorships between scientists who are themselves pub- 


PPI 


2617 


11855 


2375/92 


0.180 


0.387 


0.461 


3.73 


lishing on the topic of networks [29 . The network con- 


NS 


1461 


2742 


379/268 


0.016 


0.878 


0.462 


1.85 


tains 1589 scientists, and 128 of which are isolated. Here 


Grid 


4941 


6594 


4941/1 


0.063 


0.107 


0.003 


1.45 


we do not consider those isolated nodes. The connectivity 


PB 


1224 


19090 


1222/2 


0.397 


0.361 


-0.079 


3.13 


of NS is not good, actually, NS is consisted of 268 con- 


INT 


5022 


6258 


5022/1 


0.167 


0.033 


-0.138 


5.50 


nected components, and the size of the largest connected 


USAir 


332 


2126 


332/1 


0.406 


0.749 


-0.208 


3.46 



component is only 379. (iii) Grid. — An electrical power 

grid of western US |19j . with nodes representing genera- Table 1 summarizes the basic topological features of 
tors, transformers and substations, and edges correspond- those networks. Brief definitions of the monitored topo- 
ing to the high voltage transmission lines between them, logical measures can be found in the table caption, for 
(iv) PB. — A network of the US political blogs [30 . The more details, please see the review articles [TJ[21[31I11[S] . We 
original links are directed, here we treat them as undi- here give a few remarks for the numbers which may be un- 
reeled links, (v) INT. — The router-level topology of the expected to some readers: (i) It is well known that in the 
Internet, which is collected by the Rocketfuel Project [31 . protein-protein interaction networks, links between highly 
(vi) USAir. — the network of US air transportation sys- connected proteins are systematically suppressed, while 
tem, which contains 332 airports and 2126 airlines [32]. those between highly-connected and low-connected pairs 
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are favored [34]. That is to say, the assortative coefficient 
should be negative for PPI (for example, as reported in 
Ref. [3T] , the Yeast PPI network has an assortative coeffi- 
cient -0.156), however, in the present network, the assorta- 
tive coefficient is very positive, as 0.461. It is because the 
data set used here |28j is determined from functional inter- 
actions and not from physical interactions. More detailed 
discussion can be found in Ref. |35j . (ii) The extremely 
large clustering coefficient of NS dues to the specific con- 
structing rule of collaboration networks, namely all the 
participants in an act are fully connected. Relevant dis- 
cussion can be found in Appendix B of Ref. |36J . 



Table 2. Accuracies of algorithms, measured by the area under 
the ROC curve. Each number is obtained by averaging over 10 
implementations with independently random divisions of test- 
ing set and probe set. The abbreviations, CN, Salton, Jaccard, 
S0rensen, HPI, HDI, LHN, PA, and AA, stand for Common 
Neighbors, Salton Index, Jaccard Index, S0rensen Index, Hub 
Promoted Index, Hub Depressed Index, Leicht-Holme-Newman 
Index, Preferential Attachment, and Adamic-Adar Index, re- 
spectively. The entries corresponding to the highest accuracies 
among these nine measures are emphasized by black. RA and 
LP are the abbreviations of Resource Allocation Index and 
Local Path Index, proposed in Section 5 and Section 6 respec- 
tively. The parameter for LP, e, is fixed as 10"''. 



4 Comparison of Nine Similarity Measures 
Based on Local Information 

In this section, we compare prediction accuracies of nine 
similarity measures. All these measures are based on the 
local structural information contained in the testing set. 
We first give a brief introduction of each measure as fol- 
lows. 

(i) Common Neighbors. — For a node x, let r{x) de- 
note the set of neighbors of x. In common sense, two nodes, 
X and y, are more likely to have a link if they have many 
common neighbors. The simplest measure of this neigh- 
borhood overlap is the directed count, namely 

s,y = \r{x)nr{y)\. (2) 

(ii) Salton Index. — Salton index [37j is defined as 

^ _\r{x)nr{y)\ 



Measures 


PPI 


NS 


Grid 


PB 


INT 


USAir 


CN 


0.889 


0.933 


0.590 


0.925 


0.559 


0.937 


Salton 


0.869 


0.911 


0.585 


0.874 


0.552 


0.898 


Jaccard 


0.888 


0.933 


0.590 


0.882 


0.559 


0.901 


S0rensen 


0.888 


0.933 


0.590 


0.881 


0.559 


0.902 


HPI 


0.868 


0.911 


0.585 


0.852 


0.552 


0.857 


HDI 


0.888 


0.933 


0.590 


0.877 


0.559 


0.895 


LHN 


0.866 


0.911 


0.585 


0.772 


0.552 


0.758 


PA 


0.828 


0.623 


0.446 


0.907 


0.464 


0.886 


AA 


0.888 


0.932 


0.590 


0.922 


0.559 


0.925 


RA 


0.890 


0.933 


0.590 


0.931 


0.559 


0.955 


LP 


0.939 


0.938 


0.639 


0.936 


0.632 


0.900 



^k{x) X k{y)' 



(3) 



where k{x) ~ 1-^(2;) | denotes the degree of x. Salton index 
is also called cosine similarity in the literature. 
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(iii) Jaccard Index. — This index was proposed by Jac- (viii) Preferential Attachment. — The mechanism of 
card [38j over a hundred years ago, which is defined as preferential attachment can be used to generate evolving 

\r{x) r]r{y)\ ^ ^ scale-free networks (i.e., networks with power-law degree 



\r{x)ur{y)\- 



distributions), where the probability that a new link is 
(iv) S0rensen Index.- This index is mainly used for ^^^^^^^^^^ ^ the node x is proportional to fc(x) [20]. Simi- 
ecological community data [39] , which is defined as mechanism can also lead to scale-free networks without 

fr \ growth [43j . where at each time step, an old link is removed 

k{x) + k{y) ■ 

and a new link is generated. The probability this new link 



(v) Hub Promoted Index. — This index is proposed for 
quantifying the topological overlap of pairs of substrates 
in metabolic networks [1^, defined as 

\rix)nriy)\ 



is connecting x and y is proportional to k(x) x k{y). Moti- 
vated by this mechanism, a corresponding similarity index 
can be defined as 



(6) 



miii{fc(a:), k{y)} ' ^ ^ f.^y^^ (g) 

Under this measure, the links adjacent to hubs (here, the 



term "hub" represents node with very large degree) are 
probably assigned high scores since the denominator is 
determined by the lower degree only. 

(vi) Hub Depressed Index. — Analogously to the above 
index, we consider a measure with opposite effect on hubs 
for comparison, which is defined as 

\r{x)nr{y)\ 



which has already been suggested as a proximity measure 
[33] , as well as been used to quantify the functional signifi- 
cance of links subject to various network-based dynamics, 
such as percolation [45j . synchronization 46J and trans- 
portation [A7 . Note that, this index requires less informa- 
tion than all others, namely it does not need to know the 
neighborhood of each node. As a consequence, it also has 

(7) 



max{fc(2;) fc(y)}' the minimal computational complexity. 

r ■■\ T ■ h-t u 1 AT T J rpi ■ ■ i ■ (Ix) Adamic- Adar Index. — This index refines the sim- 

(vu) Leictit-Holme-Newman Index. — i his index gives ^ ' 

1- 1 • -1 •, ; i • ii, i T, pie counting of common neighbors by assigning the lower- 
high similarity to node pairs that have many common ^ 6 j 6 & 

• , , J J- i ii -1.1 • u i i connected neighbors more weights 1481, which is defined 

neighbors compared not to tlie possible maximum, but to o l 

the expected number of such neighbors [41]. It is defined 

as ^ inM^y ^^^^ 



\r{x)r\r{y)\ 



loe:fc(z) 



k{x) X k{y) ' We present the algorithmic accuracies for the six ex- 

where the denominator, k{x) x fc(j/), is proportional to the ample networks in Table 2, with those entries correspond- 

expected number of common neighbors of nodes x and y ing to the highest accuracies being emphasized by black, 

in the configuration model [32] • To our surprise, the simplest measure, common neighbors. 
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performs the best. This resuh is in accordance with the and algorithmic accuracy based on PA can be found from 

one reported in Ref. [17j for social collaboration networks, our numerical results. The reason is twofold. Firstly, links 

Besides CN, Adamic-Adar index performs the next best between pairs of large-degree nodes contribute positively 

for its accuracies are always close to the best one, while to the assortative coefficient and are assigned high scores 

others, such as Jaccard index, S0rensen index and HDI, by PA, while links between pairs of small-degree nodes 

perform far worse in the cases for PB and USAir. also contribute positively to the assortative coefficient but 

■NT i ii i ii c i r /-lAT i T TT-NT src dlsfavored by PA. Actually, assortative coefficient is 

Note that, the farst seven measures, from CN to LHN, 

, ,.£r i • 1 • i rr 11 ii i i an integrated measure involving many ingredients, and 
only dinerent m denominators. It ail the nodes have pretty ° o ^ o i 

, J J- i n TT there is no simple relation between this measure and the 

much the same degrees, corresponding to a very small ii, 

,1 j-rr ,1 1 • performance of PA. Secondly, assortative coefficient itself 

then the dmerence among those measures becomes m- ^ 

^ T 1 r • i 1 T -i 1 is very sensitive to the degree sequence, and a network 

sigmhcant. m addition, tor a given network, it its cius- o i i 

, . a: • i • 11 1 i J 1, of higher degree heterogeneity tends to be disassortative 

termg coeincient is very small, whether two nodes have o o a j 

. 1 , 1 , . , , 1 1-1 1491 . Therefore, this single parameter can not reflect the 

common neighbors plays the most important roie, while ' — ' ° 

^1 1 . ^ . , . i i T 1 111 detailed linking patterns of networks. Clearly, if the large- 

the denominator is less important, m a word, remarkable ° o 

j.rr 1. r J 1 degree nodes are very densely connected to each other, 

ditterence among those seven measures can be round only " j j 

■ n ,1 -i J i 1 • li 111 1 and the small-degree nodes are rarely connected to each 

II the monitored network simultaneously has large cius- ° •' 

, ■ ■ ^ J ^ J 1 i -i 1 other, PA will perform relatively good. The former relates 

termg coemcient and large degree heterogeneity, such as ° 

PPI, PB and USAir. As shown in Table 2, the perfor- *° ^^-'^^l^^'^ '^^^^"^^"^ phenomenon [SD], and we have 

r ii 1 -ii Tin 1 TTo A. ■ checked that PB and USAir exhibit obvious rich-club phe- 

mances ot those seven algorithms on PB and UbAir are 

1 . 1 j-rr i 1 i r nnr ii i ii nomcuou with respcct to their randomized versions (we 

obviously ditterent, but ior PPi, they are more or less the ^ ^ 

. .11 . ,1 , Tinr • i i- followed the method proposed by Colizza et al. 1511 . and 

same. A possible reason is that PPI is a very assortative if j i — n 

^ ,. „ in'i\ 1 ii i 1 r 1- 1 they have already demonstrated the prescuce of rich-club 

network (i.e., r = 0.461), and thus two nodes oi a link j j f 

, ^ , . , , . , , ,1 i-ix phenomenon in the air transportation network). In addi- 

tend to have similar degrees, which reduces the dmerence ' 

. J . , tion, in USAir, more than 40% of nodes are very small 

m denominators. 

local airports, with degrees no more than 3. A local air- 

The preferential attachment has the worst overall per- 

port usually connects to a nearby central airport and a 

formance, however, we are interested in it for it requires 

very few hubs, the direct links between two local airports 

the minimal information. One may intuitively think that 

are rarely found. This topological feature is also favored 

PA will give good predictions for assortative networks, 

by PA. As shown in Table 2, PA gives relatively good pre- 

while performs badly for disassortative networks. How- 
dictions for PB and USAir, in accordance with the above 

ever, no obvious correlation between assortative coefhcient 
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discussion. Note that, all the other eight measures will au- within the range of about 2000 kilometers [SS]. As a final 
tomatically assign zero score to the pair of nodes located in remark, comparing Eq. (8) and Eq. (9), LHN is, to some 
different components. Therefore, PA performs badly when extent, inverse to PA, therefore when PA performs badly, 
the network is consisted of many components. This is the LHN will give relatively good predictions, and vice versa, 
very reason why PA gives very bad predictions for NS, al- 



though NS has clear rich-club phenomenon. We also note 
that, PA performs even worse than pure chance for the In- 
ternet at router level and the power grid. In these two net- 
works, the nodes have well-defined positions and the links Except PA, all the others introduced in the last section are 



5 Similarity Measure Based on Resource 
Allocation 



are physical lines. Actually, geography plays a significant neighborhood-based measures. Although they are simple 
role and the links with very long geographical distances and mathematically graceful, they are not tightly related 
are rare (the empirical analysis of spatial dependence of to any physical processes. In this section, motivated by the 
links in the Internet can be found in Rcf. [52 , and the resource allocation process taking place in networks [55], 
absence of clustering-degree correlation in the router-level ^e propose a new similarity measure, which has overall 
Internet and power grid can be considered as an indica- higher accuracy than all the measures mentioned in Sec- 
tor of a strong geographical constraint [53 ). PA can not tion 4. 

take into account the effect of geographical localization at Considering a pair of nodes, x and y, which are not di- 

all. As local centers, the large-degree nodes have longer meetly connected. The node x can send some resource to y, 

geographical distances to each other than average, corre- ^j^h their common neighbors playing the role of transmit- 

spondingly, they also have less probability to directly con- ters. In the simplest case, we assume that each transmitter 

nect to each other. Actually, these two networks exhibit ^as a unit of resource, and will averagely distribute it to 

the anti-rich-club phenomenon, that is, the link density all its neighbors. The similarity between x and y can be 

among very-large-degree nodes are even lower than the defined as the amount of resource y received from x, which 



randomized versions. This anti-rich-club effect leads to the 



is: 



bad performance of PA. In contrast, although USAir has „ _ \^ ^ /i i \ 

Sxy- 2^ ^^^y liij 

well-defined geographical positions of nodes, its links are ^ ( ) (J*) 

Clearly, this measure is symmetry, namely Sxy = Syx- 

not physical. Empirical data demonstrated that the air 

The algorithmic accuracies on the six example net- 
transportation networks show an inverse relation between 

works are presented in Table 2, with RA the abbreviation 

clustering coefficient and degree [51], and the number of 

of resource allocation. Compared with all the nine mea- 

airline flights is not sensitive to the geographical distance 

sures introduced in Section 4, RA performs the best, es- 
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pecially for the networks (i.e., PB and USAir) with large In this case, Sxy Syx- This idea has aheady found its 

clustering coefficient, high degree heterogeneity and ab- applications in a personalized recommendation algorithm 

sence of strongly assortative linking pattern. It is observed of bipartite user-object networks [591l60j . 
that, RA exhibits particularly good performance on US- 
Air. The reason may be that the resource allocation pro- 

6 Improving Algorithmic Accuracy by 

cess is originally proposed to explain the nonlinear corre- 

, . , . , Breaking the Degenerate States 

lation between transportation capacity and connectivity 
of each airport [54 | l57 ( l58] . 

The neighborhood-based measures require only the infor- 

Note that, although being resulted from different mo- „ ,, ; • 11 .1 n i 1 

matron ot the nearest neighbors, thereiore have very low 

tivations [48ll56] , the Adamic-Adar index and resource al- 



computational complexity. However, the information usu- 

location index have very similar form, indeed, they both „ . „ . , , , , -t, ,, , , , 

ally seems msumcient, and the probability that two node 

depress the contributions of common neighbors with high . . , , , • i ■ 1 mi , • , ,1 

° " pairs are assigned the same score is high. I hat is to say, the 

degrees. The difference between -, „ J", , , and -r)-^ (see Eq. . , , , , , , . ., . , ,. ^. . , , , r 

iogA:(2) ^ ^ neighborhood- based similarity is less distinguishable from 

(10) and Eq. (11)) is insignificant if the degree, fc(z), is , ,, Tr • 1 ,1 • 1 , 1 

^ ' -T- \ II o c 1 \ n each other, li we consider the score assigned to a node 

small, while it is great if fc(z) is large. Therefore, when the • •, ,, , • i - , 

pair as its energy, then many node pairs crowd into a very 

average degree is very small, the prediction results of AA ^ , 1 m i ■ T^Tm 1 , 1 

few energy levels, iakmg IM 1 as an example, there are 

and RA are very close, while for the networks of large av- _,„7 , . _„Ay r i - 1 • 1 

more than 10 node pairs, 99.59% of which are assigned 

erage degree, such as PB and USAir, the results are clearly , „ , , , ... 

zero score by CM. I'or all the node pairs having scores 

different and the RA performs better, which implies that , . , ,, r 1 • 1 • 1 , 1 

higher than 0, 91.11% of which are assigned score 1, and 

the punishment on large-degree common neighbors of A A . ■ 1 r,TT- ^■. ■ r 

4.48% are assigned score 2. Using a little bit more mforma- 

is insufficient. tion involving the next nearest neighbors may break the 
RA can be extended to the asymmetry case. Assum- "degenerate states" and make the scores more distinguish- 
ing a unit of resource is located in x, which wiU be equally able. Denoting A the adjacent matrix, where Ar^y = 1 if 
send to all x's neighbors, each of which will equally dis- x and y are directly connected, and A^y = otherwise, 
tribute the received resource one step further to all its Obviously, {A^)xy is the number of common neighbors of 
neighbors. The amount of resource a node y received can nodes x and y, which is also equal to the number of differ- 
be considered as the importance of y in x's sense, denoted ent paths with length 2 connecting x and y. And if x and 
as y are not directly connected (this is the case we are inter- 
ested in), {A^)xy is equal to the number of different paths 



^xy 



with length 3 connecting x and y. The information con- 



2 ^3 /iQ\ removed, the similarities between x and other nodes are 
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tained in can be used to break the degenerate states, many hubs are common neighbors of x' and y' , and x' 
and thus we define a new measure as and y' may be directly connected. If the link {x,x') is 

S = A^ + eA^, (13) 

all zero for both CN and LP. If {x,x') exists, by LP, the 

where S denotes the similarity matrix and e is a free pa- 
similarities Sxy' (by a;-a;'-hub-y'), Sxy (by x-x'-y'-y), and 

rameter. We call it Local Path (LP) index for it makes 

Sxh where h represents a hub node (by x-x'-huh-h and/or 

use of the information of local paths with lengths 2 and 

x-x'-y'-h) are positive due to the contributions of paths 

3. Clearly, LP degenerates to CN when e = 0. Here, the 

with length 3. There are many links connecting small lo- 

information in is only used to break the degenerate 

cal airports and local centers, some of which are removed, 

states, therefore e should be a very small number close to 

and the others are kept in the testing set. According to the 

zero (of course, given a network, one can tune e to find its 

above discussion, the removed links have lower score than 

optimal value corresponding to the highest accuracy, how- 

the nonexistent links due to the additional item e^. In a 

ever, this optimal value is different for different networks, 

word, the very specific structure of USAir (the hierarchi- 

and a parameter-dependent measure is less practical in 

cal organization consisted of hubs, local centers and small 

dealing with huge-size networks since the tuning process 

local airports) makes the LP worse than the simple CN. 

may take too long time). In the real implementation, we 

In this specific case, we can break the degenerate states in 

directly count the number of different paths with length 3, 

the opposite direction by setting e being equal to — 10~^, 

which is miich faster than the matrix multiplication, and 

which lead to an accuracy 0.945, higher than the one by 

thus Eq. (13) is also based on local calculation. 

CN, 0.937. 

The algorithmic accuracies on the six example net- 
works are presented in Table 2, where this measure is de- 

7 Conclusion and Discussion 

noted by LP and the parameter is fixed as e = 10 . It is 

happy to see that the accuracy, except for USAir, can be In this paper, we empirically compared some link predic- 
largely enhanced by LP. In USAir, the large-degree nodes tion algorithms based on node similarities. All the sim- 
are densely connected and share many common neighbors, ilarity measures discussed here, including the two newly 
Some links among large-degree nodes are removed into the proposed ones, can be obtained by local calculations. Nu- 
probe set. Even without the contribution of eA^ , those merical results on the nine well-known measures indicate 
links are assigned very high scores, thus the additional that: (i) The simplest measure, common neighbors, per- 
item, changes little of their relative positions. Con- forms the best, and the Adamic-Adar index is the second; 
sidering two small local airports, x and y, which are con- (ii) Remarkable difference among these measures, exclud- 
nected to their local central airports, x' and y' . Of course, ing the Adamic-Adar index and the preferential attach- 
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ment, can be observed only if the monitored network is information [HT]. A similar idea has also been adopted in 

with large clustering coefficient, high degree heterogene- the network-based traffic dynamics, where the informa- 

ity, and absence of strongly assortative linking pattern; tion of the next nearest neighbors can sharply enhance 

(iii) The preferential attachment performs relatively good the traffic efficiency compared with the case in which only 

if the monitored network has the rich-club phenomenon. the information of the nearest neighbors is known [62] . 

We proposed a new measure, RA, motivated by the re- Although the framework adopted here is very simple, 
source allocation process, which is equivalent to the one- it opens a rich space for investigation since in principle, all 
step random walk starting from the common neighbors, algorithms can be embedded into this framework differing 
This measure has a similar form to the Adamic-Adar in- only in the similarity measures. Besides ones discussed in 
dex, but performs better, especially for the networks with this paper, a number of similarities are based on the global 
high average degrees. We guess RA is particularly suitable structural information, such as the average commute time 
for link prediction of transportation networks, whose va- of random walk [25j . the number of spanning trees em- 
lidity needs further evidence from more empirical results, bedding a given node pair [26', the pseudoinverse of the 
We here strongly recommend this measure to relevant ap- Laplacian matrix [63], and so on. Some other similarity 
plications and theoretical analyses, not only for its good measures are even more complicated, depending on pa- 
performance, but also for its simplicity and grace. rameters. These include the Katz index |64| and its variant 

r 1 ii i T 1 -1 1411 . the transferring similarity 165 , the PageRank index 

i'urthermore, we found that many Imks are assigned ' — ' ^ l » 

1 J , , • ii • r 1661 . and so on. These measures may give better predic- 

same scores based on the local measures using the mtor- ' — " ° ^ 

r,i i • 11 1 1 -i i- r tions than the local ones, however, the calculation of such 

matron of the nearest neighbors only. Exploitation ot some ' ' 

, , • r r ii i i -11 measures, including determination of the optimal param- 

additional information of the next nearest neighbors can > o 

^ , 1 , , , 11 ii 1 eters for specific networks, is of high complexity, and thus 

therefore break the degenerate states and enhance the al- ^ > o i- j : 

.,, . T 1 1- i- ii 1 -ii infeasible for huge-size networks. Anyway, up to now, we 

gorithmic accuracy, m real applications, the algorithms ° j ji t- ? 

, , 1 1 1 1 1 i- 11 • i r ii lack systematic comparison and clear understanding of the 

based on global calculations may be less emcient for they j t- a 

, ^. 1/1 1 -1 ii 1 performances of these measures, which is set as our future 
require long time and/ or huge memory, while the algo- 
rithms only exploited very local information may be less ^0'"ks. 

effective for their low accuracies. A properly designed al- Empirical analysis on more real networks as well as 

gorithm can provide a good tradeoff just like the LP in- more known and newly proposed similarity measures is 

dex presented in this paper. Indeed, it is shown recently very valuable for building up knowledge and experience, 

that the LP index provides competitively accurate pre- and we can expect a clear picture of this issue can be com- 

dictions compared with the indices making use of global pleted by putting together of many fragments from respec- 
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tive empirical studies. However, the empirical results may 9. A. Grabowski, N. Kruszewska, R. A. Kosiriski, Phys. Rev. 
be not clear at all since many unknown and uncontrollable E 78, 066110 (2008). 

ingredients are always mixed together in real networks. An 10. H.-B. Hu, X.-F. Wang, Europhys. Lett. 86, 18003 (2009). 

alternative way is to build artificial network models with 11. L. Getoor, C. P. Diehl, Link Mining: A Survey, In Pro- 
controllable topological features, and to compare the pre- ceedmg of the ACM SIGKDD International Conference on 
diction algorithms on these models (see Ref. [SI] about Knowledge Discovery and Data Mining (ACM Press, New 
the comparison of link prediction algorithms on modeled York, 2005). 

networks with controllable density and noise strength). 12. M. Graven, D. DiPasquo, D. Freitag, A. McCallum, T. 

Mitchell, K. Nigam, S. Slattery, Artificial Intelligence 118, 
This work is partially supported by the Swiss National Sci- 69 (2000). 

ence Foundation (Project 205120-113842) and Physics of Risk 13 a. Popescul, L. Ungar, Statistical relational laming for link 
through project C05.0148. T.Z. acknowledges the National Nat- prediction, In Workshop on Learning Statistical Models from 
ural Science Foundation of China (Grant Nos. 10635040 and Relational Data (ACM Press, New York, pp. 81-90, 2003). 
60744003). L.L. acknowledges the National Scholarship Fund 3 ^^^j^^^^ ^ _p ^^^^^ p ^^^^^^^ ^ j^^^^^^ ^^^^ 

of China Scholarship Council. ^^^^ relational data. In Proceeding of Neural Information 
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